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Abstract 

We present a new approach to Bayesian inference that entirely avoids Markov 
chain simulation, by constructing a map that pushes forward the prior measure to 
the posterior measure. Existence and uniqueness of a suitable measure-preserving 
map is established by formulating the problem in the context of optimal transport 
theory. We discuss various means of explicitly parameterizing the map and com- 
puting it efficiently through solution of an optimization problem, exploiting gradient 
information from the forward model when possible. The resulting algorithm over- 
comes many of the computational bottlenecks associated with Markov chain Monte 
Carlo. Advantages of a map-based representation of the posterior include analytical 
expressions for posterior moments and the ability to generate arbitrary numbers of 
independent posterior samples without additional likelihood evaluations or forward 
solves. The optimization approach also provides clear convergence criteria for pos- 
terior approximation and facilitates model selection through automatic evaluation 
of the marginal likelihood. We demonstrate the accuracy and efficiency of the ap- 
proach on nonlinear inverse problems of varying dimension, involving the inference of 
parameters appearing in ordinary and partial differential equations. 

Keywords: Bayesian inference, optimal transport, measure-preserving maps, 
inverse problems, polynomial chaos, numerical optimization 



1. Introduction 

The estimation of model parameters from observations is a ubiquitous problem in 
science and engineering. Inference or "inversion" are central challenges in geophysics 
and atmospheric science, chemical kinetics, quantitative biology, and a host of addi- 
tional domains. In all these settings, observational data may be indirect, noisy, and 
limited in number or resolution. Quantifying the resulting uncertainty in parameters 
is then an essential part of the inference process. Uncertainty in model parameters in 
turn drives uncertainty in predictions; characterizing the latter is vital to the use of 
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computational models in design and decision-making, and even to subsequent rounds 
of data collection. 

The Bayesian statistical approach provides a foundation for learning from noisy 
and incomplete data, a natural mechanism for incorporating heterogeneous sources 
of information, and a complete assessment of uncertainty in model predictions [1-5]. 
Uncertain quantities are treated as random variables, and the posterior distribution, 
i.e., the probability distribution of any quantity of interest (be it a parameter or a 
model prediction) conditioned on data, represents one's state of knowledge about that 
quantity. Characterizing the posterior — simulating the distribution with samples; 
marginalizing; evaluating moments, quantiles, or credibility intervals — thus becomes 
one of the central computational challenges in Bayesian inference. 

In specialized cases (e.g., simple statistical models with conjugate priors) expec- 
tations with respect to the posterior may be evaluated in closed form. But in the 
general setting — and particularly when complex physical models enter the likelihood 
function — computational approaches are needed. By far the most widespread and ver- 
satile method for posterior simulation in this context is Markov chain Monte Carlo 
(MCMC) [2, 6-10]. MCMC requires only pointwise evaluations of an unnormalized 
density, and generates a stream of samples that can be used to evaluate posterior 
expectations. 

Despite its power and practical utility, MCMC suffers from many limitations. 
Samples generated by the algorithm are necessarily correlated; strong correlation 
among successive samples leads to smaller effective sample sizes and larger errors in 
posterior estimates [9]. An efficient MCMC algorithm endeavors to make the effective 
sample size after N steps as close to N as possible, as the posterior evaluation(s) 
required at each step may be costly. Efficient sampling in Metropolis-Hastings MCMC 
[8] rests on the design of effective proposal distributions, but this task is difficult for 
target distributions that contain strong correlations or localized structure, particularly 
in high dimensions. Improvements to MCMC proposal mechanisms are therefore the 
subject of intense current interest. A non-exhaustive list of recent methodological 
advances includes adaptivity based on past samples [11, 12], Langevin methods [13, 
14], the use of Hessian information [15] and Hessians with low-rank structure [16], 
Hamiltonian dynamics [17], the exchange of information among multiple chains [18- 
20], multi-stage proposals incorporating approximate models [21, 22], and the use of 
surrogate or reduced models to accelerate likelihood evaluations [23-27]. 

Even with these advances, MCMC remains a computationally intensive process; 
typical problems require many thousands or even millions of posterior evaluations. 
Moreover, the sampling process does not come with a clear convergence criterion, 
indicating when the chain has adequately explored the distribution or how many ini- 
tial samples to discard as "burn-in." Convergence diagnostics for MCMC are largely 
heuristic and remain something of an art [28]. Finally, the nature of MCMC is to 
represent the posterior distribution with samples; this may seem an obvious remark, 
but a sample representation immediately favors the use of Monte Carlo methods 
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for subsequent uncertainty propagation steps. For computationally intensive mod- 
els, however, sampling may be impractical. More efficient uncertainty propagation 
methods already exist [29-31] and it would be desirable to use them. Even sequential 
Monte Carlo methods [32], designed with recursive inference and forecasting in mind, 
still rely on a weighted-sample representation of the posterior. 

In this paper, we present a novel approach to Bayesian inference that relies on the 
construction of maps. We entirely avoid Markov chain simulation, and instead explic- 
itly construct a map that pushes forward the prior measure to the posterior measure. 
In other words, the map transforms a random variable X, distributed according to 
the prior, into a random variable Z, distributed according to the posterior. Existence 
and uniqueness of a monotone map is assured under rather weak conditions [33]. The 
map is actually found through solution of an optimization problem, which requires 
only the ability to evaluate the posterior density up to a normalizing constant (exactly 
as in Metropolis-Hastings MCMC). If gradients of the likelihood function or forward 
model are available, however, then we immediately take advantage of them. To make 
the optimization problem finite dimensional, the map is described by multivariate 
orthogonal polynomials; it therefore becomes a polynomial chaos expansion [29, 34] 
of the posterior. 

As a byproduct of the optimization procedure, the scheme automatically calculates 
the posterior normalizing constant, i.e., the marginal likelihood or evidence, which is 
an essential quantity in Bayesian model selection [35]. We show that the optimization 
procedure also provides an unambiguous convergence criterion, with a quantity that 
can be monitored in order to decide whether to terminate iterations or to enrich the 
polynomial space used to describe the map. With a map in hand, one can cheaply 
generate independent posterior samples by simulating from the prior. Also, as the 
map is represented by polynomials orthogonal with respect to standard measures, 
posterior moments may be evaluated algebraically. 

To place the map-based scheme in context, we note that variational Bayesian 
methods [36] also convert inference into an optimization problem, by approximat- 
ing the posterior with a simpler distribution from a parameterized family, perhaps 
also chosen to have a particular conditional independence structure. These are ap- 
proximations of the posterior (or posterior predictive) distribution function, however. 
Distributions must be chosen from particular families to facilitate optimization and 
sampling — in contrast to the present scheme, which approximates random variables 
directly. The present scheme has closer links to implicit ffitering methods [37, 38] 
for sequential data assimilation. These methods rely on a weighted-sample represen- 
tation of each probability distribution, but use maps to transport particles to high- 
probability regions of the posterior. Implicit filtering effectively uses many maps, 
though, one for each particle and assimilation step. Unlike the present scheme, maps 
are not constructed explicitly; rather one evaluates the action of each map on a single 
particle. The role of optimal transport in data assimilation has also been raised in 
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[39]. 

This paper is organized as follows: In Section 2 we present the basic formulation 
and several alternative versions of the optimization problem that yields the map (Sec- 
tions 2.3 and 2.4). Methods for solving these optimization problems are presented 
in Section 3. Section 4 presents a range of numerical results. First, we demonstrate 
our inferential approach on a linear problem where the posterior has known analyt- 
ical form (Section 4.1). Then we perform parameter inference in several nonlinear 
ODE systems with strongly non-Gaussian posteriors, and where the posterior dif- 
fers markedly from the prior (Sections 4.2 and 4.3). Finally, we demonstrate the 
approach on high-dimensional inverse problems involving the estimation of spatially 
heterogeneous coefficients in elliptic PDEs (Section 4.4). 

2. Formulation 

2.1. Bayesian framework 

We begin by describing the Bayesian inference framework and setting notation 
for the subsequent developments. Consider the task of inferring model parameters x 
from observations d. For simplicity, we will let both the parameters and observations 
be real-valued and finite-dimensional. In the Bayesian setting, the parameters and 
observations are treated as random variables. Let (O, J-", P) be a probability space, 
where is a sample space, J-" is a cr-field, and P is a probability measure on {Q,J^). 
Then the model parameters X : — i- M" are associated with a prior measure /iq on 
M", such that no{A) = F {X~'^ (A)) for A e M". We then define p{x) = duo/dx to 
be the density of X with respect to Lebesgue measure. For the present purposes, 
we assume that such a density always exists. We then define the expectation with 
respect to the prior measure as E [5f(X)] = j g{x)dfiQ{x) = J g{x)p{x)dx, for any 
Borel-measurable function g on M"^. The observational data are described by a random 
variable d taking values in MJ^. Inference requires that we define a probabilistic model 
for the data; in terms of probability densities, this model is simply p{d\x). Viewed as a 
function of x, this conditional density defines the likelihood function L{x; d) = p{d\x). 
Bayes' rule then yields the posterior probability density q{x) = p{x\d): 

g(x) = (1) 

where /3 is the evidence or marginal likelihood (3 = J L{x] d)p{x) dx. This posterior 
density q is associated with a posterior measure yU on M", such that dfi{z) = q{z)dz 
and the likelihood function is proportional to the Radon-Nikodym derivative of jj, 
with respect to /io, L{z) oc dfi/dfio [!]• 

In inverse problems [1, 3, 40], the likelihood frequently incorporates a deterministic 
function h mapping the parameters to some idealized observations. This function 
/i : M" — )■ M™' is termed the forward model. The discrepancy between the idealized 
model predictions and the actual data is often assumed to be additive: d = h{x) + e, 
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where e is a random variable. A common further assumption is that e is Gaussian, 
zero-mean, and independent of the model parameters, i.e., e ~ A^(0, S), leading to 
the following form for the likelihood function: 



where \\u\\a '■= \\A^2u\\, for any positive symmetric matrix A. 

While the Bayesian formulation of inverse problems commonly leads to likelihood 
functions that take the form of (2), we emphasize that this particular form is not an 
assumption or requirement of the map-based inference method described below. 

2.2. Inference with a map 

The core idea of our approach is to find a map that pushes forward the prior 
measure to the posterior measure. In other words, suppose that X is a random 
variable distributed according to the prior and that Z is a random variable distributed 
according to the posterior. Then we seek a map / : M" — )■ M" that satisfies the 
following constraint, where equality is in distribution: 



Equivalently, we seek a map / which pushes forward /iq to /i, i.e., for which /i = /j/iQ. 
A schematic of such a map is given in Figure 1. The map necessarily depends on the 
data and the form of the likelihood function. (In the case of (2), the map would thus 
depend on the data rf, the forward model /i, and the distribution of the observational 
noise e.) If the posterior distribution were equal to the prior distribution, an identity 
map would suffice; otherwise more complicated functions are necessary. Note that, 
for clarity, we adhere to the strict association of a random variable with a distribu- 
tion. Thus the prior and posterior distributions are associated with distinct random 
variables {X and Z, respectively) representing distinct states of knowledge about the 
same set of parameters. This view is slightly different than the usual notion of a 
single random variable for which we "update" our belief. 

Assuming that a map satisfying (3) exists and is monotone,^ the measure of 
represented by the posterior probability density g, can be transformed into a proba- 
bility density p for the input random variable X: 



where Dxf = df/dx is the Jacobian of the map. But we already have a density 
for X, namely the prior density p. This in turn defines an alternate criterion for 



monotone (non-decreasing) map f{x) on M" is one for which (x2 — xi) {f{x2) — f{x\)) > 
for every pair of distinct points xi,X2 G M". Issues of existence and monotonicity will be addressed 
in the next two subsections. 





Z = f{X), with X ~ /io, Z fi. 



(3) 
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an inferential map: the transformed density p should be equal to p, as depicted in 
Figure 2: 

p{x) = p{x). (5) 
The Bayesian inference problem can thus be cast in the following equivalent form: 

Problem 2.1. Find a monotone map f{x) such thaf 

[Lof] ix)[pof] {x 



\det D^f\ = p{x). 



(6) 



In many cases (e.g., when the likelihood function and the prior density contain 
exponentials) the following expression is more appropriate for computation than (6): 



log (L o /) + log (po f) - log (3 + log |det D^f\ - logp = 0. 



(7) 



We emphasize that despite the appearance of the posterior normalizing constant /3 in 
the problem above, our final formulation will never require knowledge of /3. Indeed, 
the value of /3 will emerge as a by-product of identifying the map. 

Instead of aiming for exact equality in (6), we will define an optimization problem 
that seeks to minimize the discrepancy between the prior density p and the approx- 
imate and map-dependent prior density p. This discrepancy can be expressed, for 
instance, as the KuUback-Leibler divergence or the Hellinger distance from p to p. 
We will begin by rewriting these discrepancies in a more convenient form. For the 
Hellinger distance Q{p,p) we have: 



Q{p,P) 



\/p{x) - \/p{x)) dx 



p{x) + p{x) — 2\/ p{x)p{x) dx 




1 -E 



(8) 



Similarly, the KuUback-Leibler (KL) divergence from p to p can be rewritten as fol- 
lows: 



Dkl {p\ \p) 



p(x) 

p[x) log — — dx = —K 
p[x) 



log 



p{X) 



(9) 



•^To simplify notation we use [L o f]{x) to denote L{f{x)). We also drop the explicit dependence 
of L on the data d. 
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Both the square root in (8) and the log in (9) are concave functions $, and Jensen's 
inequahty provides that 



E 



< $ E 



(10) 



To minimize either Helhnger distance (8) or KL divergence (9) we would therefore like 
to achieve equality above, but equality holds if and only if the ratio p/p is constant in 
X [41]. Consistent with (6), the value of this constant should be unity. Re-arranging 
equation (4), we obtain the following expression for (3 

/J^M^|M^)|detI,./|. (11) 

Taking the logarithm of this expression we obtain 

T(x; /) := log (L o /) + log (p o /) + log |det DJ\ -\ogp = log/3. (12) 

Equation (12) is the cornerstone of our analysis. It states that T(x; /) should be 
constant over the support of the prior. Setting T to be a constant suggests the 
following problem formulation: 

Problem 2.2. Find f*{x) such that 



f = argmm Var [T{X; /)] (13) 



where X ~ /^o- 



Note that this formulation is equivalent to minimizing either of the discrepancy 
measures between p and p considered above (and others as well). Moreover, if / is 
allowed to lie within a sufficiently rich function space, we know exactly what minimum 
value of Var [T] can be attained: zero. As emphasized earlier, /* is computed without 
any knowledge of the evidence /3, since the latter does not appear in T. However, /3 
can be evaluated as 

/3 = exp(E[T(X)]) . (14) 
An alternative optimization problem can obtained by observing that 



Dkl (pWP) = -E 



1 Pix) 

log 



log/3-E[T(X)]. 



Since log/3 is a constant, minimizing Z^kl {pWp) is equivalent to seeking: 

r =argmaxE[T(X;/)]. (15) 

In other words, we maximize the numerical estimate of the evidence. Unlike Prob- 
lem 2.2, this formulation does not have a known optimal value of the objective func- 
tion, but Var[T] can nonetheless be monitored to assess convergence. 
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2. 3. The optimal transport formulation 

The multivariate map / : M" — ?■ takes the following general form: 

Zi = fi{xi,X2,. ■ ■ ,Xn), i = l...n (16) 

where fi are the components of /. Using the classic measure transform literature 
[33, 42-45] one can show that a map pushing yUg forward to yU exists under relatively 
weak conditions, but is not unique. To guarantee uniqueness, extra constraints or 
penalties must be enforced. Two types of constraints are employed in this paper. The 
first, discussed in this section, is an optimal transport constraint. It can be shown 
[33], under remarkably general conditions, that the optimization problem 

mmE[||X-/(X)f] 
subject to /i = /jj/io (17) 

has a unique solution /* and that this solution is monotone. Conditions for existence 
and uniqueness essentially require that the measure /io (here, the prior) not contain 
any atoms. We restate this result more precisely as follows: 

Theorem 2.3. (After [33, 45]) Let /io and fi be Borel probability measures on M" with 
fiQ vanishing on subsets o/M" having H aus dor ff dimension less than or equal to n~l. 
Then the optimization problem (17) has a solution f that is uniquely determined fiQ- 
almost everywhere. This map is the gradient of a convex function and is therefore 
monotone. 

For a detailed proof of this theorem see, e.g., [45]. 

To be sure, our true objective is not to find the optimal transport of Theorem 2.3 
per se. Instead we want to add enough regularity to Problem 2.2 such that we can 
find a monotone map satisfying /jj/^o ~ /i (or equivalently p ^ p). To this end, we use 
the optimal transport problem (17) to motivate the following penalized objective: 

mm {Dkl {pWp) + AE - /(X)f ] } . (18) 

By analogy with Problem 2.2, (18) can be replaced by: 
Problem 2.4. Find f to solve the following optimization problem 

mm {Var [T(X)] + AE - /(X) f ] } . (19) 

Equations (18) and (19) should be understood as follows. The first term en- 
forces the measure transformation from prior to posterior,"^ while the second term is 



^See Appendix A for a discussion of the relative magnitudes of Var[r] and the KL divergence. 
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a penalty inspired by the optimal transport formulation to promote regularity and 
monotonicity in /. The magnitude of the penalty term is controlled by the multiplier 
A. In the optimization scheme to be detailed in Section 3, this multiplier is chosen 
adaptively, such that the magnitude of the penalty term is a prescribed fraction (e.g., 
1/10) of the magnitude of Var [T] or Dkl {pWP) at the start of each cycle. The stop- 
ping criterion for the optimization scheme evaluates only the first term of (18) or 
(19), since this is the term that enforces the accuracy of the posterior representation. 
We also emphasize that in every example tested below (see Section 4), the value of A 
was not a critical parameter in the optimization procedure. In fact A can be assigned 
a zero value when Var [T(X)] is sufficiently small. 

One could also, of course, revert to solving the constrained optimization prob- 
lem (17) and thus find the true L^-optimal transport. We do not pursue such a 
scheme here, however, because our core interest is in satisfying the measure transfor- 
mation condition, not in finding the optimal transport per se. 

2.4- Triangular formulation 

An alternative method for guaranteeing a unique solution is to enforce a "trian- 
gular" structure for the map. This construction is inspired by the Rosenblatt trans- 
formation [46, 47], where now the ith component of / can depend only on the first i 
inputs: 

Zi = fi{xi,...,Xi). (20) 

In this formulation, the regularity of the problem is provided by the triangular 
structure and no additional penalty term is required to ensure uniqueness. Indeed, 
[48] shows that there exists a unique monotone map / of the form (20) satisfying 
/i = /j/io-^ This map is known as the Knothe-Rosenblatt re-arrangement [45, 48]. 

The Rosenblatt re-arrangement is typically computed via an iterative procedure 
[45] that involves evaluating and inverting a series of marginalized conditional cu- 
mulative distribution functions. This procedure is computationally very intensive, 
since it involves computing a large number of high-dimensional integrals. Instead of 
explicitly computing the Rosenblatt transformation in this way, we propose to solve 
the following problem: 

Problem 2.5. Find 

f* = arg mm Var [T{X)] , (21) 

where f is subject to the structure (20). 

One could modify the optimization problem by replacing Var [T(X)] with Dkl {p\ \P) 
in (21). We thus propose to find a Rosenblatt-type re-arrangement all-at-once rather 



^This result [48] places conditions on /ig essentially requiring that the marginals of /iq have no 
atoms. A sufficient condition is that /^o be absolutely continuous with respect to Lebesgue measure. 
Also, just as in Theorem 2.3, uniqueness of the map holds /xo-alniost everywhere. 
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than component by component, and with a stopping criterion that involves the mag- 
nitude of the entire objective function being smaller than a prescribed threshold. 
We must acknowledge, however, that the current theory [48] does not preclude the 
existence of non-monotone measure-preserving maps that have the triangular struc- 
ture. Thus the question of whether minimizing Var [T(X)] subject to (20) guarantees 
monotonicity of the map remains open. In numerical tests on a wide variety of in- 
ference problems, however, the solutions we obtain using the triangular constraint 
are consistently monotone. Monotonicity is easily verified by examining det D^f; it 
should have the same sign (/iQ-a.e.) over the support of the prior. 

Note that one can exploit the triangular structure in order to express the deter- 
minant of the Jacobian of the map as a product of the diagonal terms: 

deti)./ = det| = n^«^lip^. (22) 

i=l * 

3. Solution Algorithm 

Problem 2.4 and its triangular variant are stochastic optimization problems, in 
that they involve expectations with respect to the prior distribution. They are 
also infinite-dimensional, in that / (in principle) may be an element of an infinite- 
dimensional function space. Moreover, they may involve likelihood functions L that 
are defined implicitly through the solution of forward problems h and that are com- 
putationally intensive. We will describe algorithms for the solution of these problems, 
taking into account all of the above challenges. The algorithms involve sample ap- 
proximations of the prior expectation, a flexible parameterization of /, and efficient 
gradient-based optimization. Later we describe a continuation-type method for solv- 
ing a sequence of simpler optimization problems to yield a "cascade" or sequence of 
maps. 

3.1. Monte Carlo sampling 

It is in general complicated or impossible to compute moments of T{X) using 
analytical formulae. Instead, we use Monte Carlo sampling to approximate the ex- 
pectation operator 

1 

E[r(x)]^-^r(x«) (23) 

^ i=l 

where x*^*-* ~ /io- Notice that the expectation is taken with respect to the prior, 
which is assumed to be a distribution from which one can easily generate independent 
samples. (Extensions to the method may address situations where prior sampling is 
difficult; see Section 5.) The number of samples A^^ is chosen adaptively over the 
course of the optimization algorithm, as described in Section 3.4. Moreover, the set 
is "refreshed" with a new batch of i.i.d. samples between cycles. In this sense, the 
present scheme bears strong similarities to the SAA (sample average approximation) 
approach to stochastic programming [49]. 
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3.2. Polynomial approximation of f 

As noted above, the optimization problems posed in Section 2.3 and 2.4 are 
infinite-dimensional. To make them more computationally feasible, the map f{x) 
is approximated using a finite set of basis functions. In a high dimensional infer- 
ence problem, the use of localized basis functions is in general not efficient; instead, 
a global basis is more suitable. In our implementation, we represent / using mul- 
tivariate polynomials orthogonal with respect to the prior measure; in other words, 
we write the map as a polynomial chaos expansion [29, 30, 34, 50]. Some additional 
intuition for polynomial maps is as follows. An identity map / : M" — M" yields 
a posterior that is equal to the prior. An affine map containing a diagonal linear 
transformation, i.e., where fi depends only on Xj, allows changes of location and scale 
in each component of the parameter vector, from prior to posterior, but preserves the 
form of each distribution. An affine map containing a general linear transformation 
(with each fi depending on all components of x) allows the introduction of new cor- 
relations in the posterior. Quadratic, cubic, and higher-degree maps allow even more 
complex distributional changes from prior to posterior. 

We write the polynomial expansion of / as: 

f{x) = Y,9iU^) (24) 

where i = («i,«2, ■■■■,in) £ is a multi-index, G M" are the expansion coefficients, 
and ipi are n-variate polynomials. We write each of these polynomials as 

n 

where ipi, is a univariate polynomial of order ij in the variable Xj, orthogonal with 
respect to the distribution of Xj. For simplicity, we assume here that the prior can be 
described, perhaps after some transformation, by a set of independent random vari- 
ables. That said, orthogonal polynomial expansions for dependent random variables 
with arbitrary probability measure can certainly be constructed [51]. The set of multi- 
indices J describes the truncation of the expansion. For example J = {\ : |i|^ < no} 
corresponds to a total-order expansion of degree ng, containing 

i^ = card(:r)=f " + "0=^^^ (26) 

terms. 

The orthogonal polynomial representation of the map offers several advantages, 
besides being convenient and flexible (by choice of J). First, moments of the poste- 
rior distribution — i.e., moments of f{X), with X ~ /ig — can be evaluated analytically, 
without the need for further sampling or additional forward model solves. For exam- 
ple, the posterior covariance is a weighted sum of outer products of the coefficients: 
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Cov(Z) = Xli [V'?]- Higher moments can be evaluated using known formulas 

for the expectations of products of orthogonal polynomials [30, 52]. Second, obtain- 
ing a polynomial chaos expansion of the posterior facilitates efficient propagation of 
data-conditioned uncertainty through subsequent computational models. This is a 
key step in posterior prediction and in sequential data assimilation (e.g., filtering). 
With a polynomial chaos representation in hand, a vast array of stochastic Galerkin 
and stochastic collocation methods can be employed for uncertainty propagation. 

Treating each coefficient Qi G as a column vector, we assemble the set {gi}iej 
into a matrix of size n x K, where K denotes card (JT"): 

F^=[9i 92 ... 9k]. (27) 

In the optimal transport formulation all n x K coefficients are considered the opti- 
mization variables, whereas in the triangular formulation some of these coefficients 
are fixed to zero. The map / is then then represented as 

f{x) = F^^{x) (28) 

where \E'(x) is a column vector containing every basis polynomial ipi, i E J'. 

3.3. Newton's method and nonlinear least squares 

In our implementation we use two alternative algorithms to solve the optimization 
problem: Newton's method and nonlinear least squares. In the standard Newton's 
method, we compute both the first and second derivatives of the objective function. It 
is then necessary to compute the derivatives of T{x) with respect to the optimization 
variables F: 

T{x; F) = log (L o /) + log {p o f) + log |det D.^f\ - logp 

dF \Lof df pof df J dF 

where df/dF = [In ^(-z)]. The last term is obtained as follows: 
DJ = D4F^^) = F^D,^ 
^ (log|detD,/|) = tiaceiiDJ)-' E,iD,,^) 



= trace((D,/)-'e,[D,^](z,:)) 
= [D,^]{z,:){DJ)-'e, 
^(log|detD,./|) = D,^{Djr' (29) 

where Eji is a matrix of all zeros except for a single 1 at the location (j, i) and Cj is 
a vector of zeros except for a 1 in row j . 
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The second derivatives of T can be obtained in a similar fashion. In the case of 
a Gaussian prior and Gaussian additive noise, exphcit expressions for the first and 
second derivatives are given in Appendix C. Note that to compute any derivatives of 
the hkehhood function, derivatives of the forward model are required. Using modern 
adjoint techniques, gradients and Hessian- vector products can typically be computed 
at the cost of 0(1) forward solves, independent of the dimension of the parameter 
space [16]. In many problems, however, the forward model and its derivatives can be 
accurately approximated using surrogate models [24]. 

The second derivatives of the log-determinant can be calculated explicitly as: 

(log|detD,/|) = -D^m{DJ)-\E,,D,^){DJ)-^ 



dFdFij 



= - {D,^ {Djy' e,) {[D,^ {z, :) (DJ)-') . (30) 

From the previous relations it is clear that in order to compute the Hessian of the 
log-determinant, we must first compute the matrix -Di:\E' {D^f)^^ and then use the 
outer product of appropriate column and row vectors to assemble the desired Hessian. 

Alternatively, one can use a nonlinear least squares (NLS) method to solve the 
problem. This method is particularly suited to the triangular formulation of Sec- 
tion 2.4, which does not have a penalty term. To cast the optimization objective 
as a nonlinear least squares problem, it is observed that the zero variance condition 
Var [T(X)] = is equivalent to T (x*^*-*) = constant, for x'-*^ ~ p. In other words, T is 
constant in an sense. This leads to the set of equations 

T(x»;F) = -5^r(x(^);F). (31) 

^ i=i 

If the number of optimization parameters i < nK in F is less than Ns-, then (31) is 
an overdetermined set of nonlinear equations in F, which can be solved via 

MAF = h 

pk+i ^ F'^-AF (32) 

where the rectangular matrix M G M^*^^ and the column vector b G -R^' are: 

^ dT 1 ^ dT (xO)) 

^ ' dF dF 



hii) = T(x«;F)-^f;T(x(^)). (33) 
Here, the derivative of a scalar with respect to a vector is assumed to be a row vector. 
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We have observed that in certain cases, e.g., weakly muhi- modal posteriors, con- 
vergence of the algorithm from arbitrary initial maps is more reliable if the objective 
function is modified to become 



mm { Var [T{X)] + XE 



T{X)-(3f 



(34) 



where (3 is an estimate of the evidence computed from the previous iteration. In 
this case, the nonlinear least squares formulation proceeds from the following set of 
equations: 

T(x»;F) = -^(-ly;T(x(^);F)+A/3], i = l...N,. (35) 



In general, we initialize our algorithm using the identity map, f{x) = x. Alter- 
natively, if the likelihood contains additive Gaussian noise and the prior is Gaussian, 
we linearize the forward model and compute the resulting map analytically. This 
approximate linear map is then used to initialize the algorithm. 

3.4- Algorithm structure 

We solve the optimization problem in stages, by iterating over the order of the 
polynomial representation of f{x). We typically start the algorithm with only linear 
basis functions in '^[x). After computing the best linear map, the total order of the 
polynomial basis is increased and a new corresponding map is computed — allowing 
coefficients of all the polynomial terms to be adjusted, not just those of the highest 
degree. This process is repeated until reaching a stopping criterion Var [T(X)] < 6, 
where 5 is a preset threshold. The samples used to approximate the prior expectation 
or variance are renewed each time the order of the expansion is increased. The number 
of samples A^,, is chosen such that approximations of the mean or variance obtained 
with successive sample batches lie within some predetermined relative tolerance a of 
each other, typically 5%. 

Algorithm 1 summarizes our complete approach. Notice that the multiplier A 
is used only when computing the penalized map of Problem 2.4. The value of this 
multiplier is adjusted each time the order of the expansion is increased; it is chosen 
such that the penalty term is 10% of the total value of the objective function at 
the start of the current stage. Thus A decreases as / converges towards the desired 
measure transformation. 



3.5. Composite map 

In certain problems, the map from prior to posterior can be accurately approxi- 
mated only with very high-degree polynomials. Unfortunately, the number of coeffi- 
cients in a multivariate polynomial expansion grows very rapidly with degree, leading 
to a more challenging optimization problem. 
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Algorithm 1: Structure of optimization algorithm for computing the map. 



Data: Likelihood function L, observed data, prior density p, stopping criterion 

6, sample tolerance a 
Result: Map f{x) 
f{x) = X, i = 0, and A = 1; 
no = 1; 

while Var [T{X)] > 6 do 
i = i + 1; 

Generate Ng i.i.d. samples of the prior random variable; 
if i > 1 then 

Compute Var[T(X)]; 

if \{Vai[T{X)]/V*) -l\>a then 
I Double the number of samples A^^ <(— 2Ns; 

end 
end 

Solve the optimization problem for f{x) up to desired order no, stopping 

when AF is smaller than a threshold; 
Get current Var[r(X)] and store it in V* = Var[T(X)]; 
Increase order of expansion: Uq ^ Hq + 2; 
If applicable compute a new X = O.IV* /E[\\X - f{X)f] 
end 



To overcome this difficulty, we can use a composition of maps to efficiently ap- 
proximate a high-degree transformation from prior to posterior. That is, instead of 
mapping the prior to the posterior with a single map, we introduce a sequence of 
intermediate distributions that evolve gradually from prior to posterior. With this 
sequence of distributions, we introduce a corresponding sequence of maps. Intuitively, 
this approach "relaxes" the measure transformation problem; if the input and output 
distributions are similar to each other, then the presumably the map can be well 
approximated using low-degree polynomials. Letting the number of steps approach 
infinity, one could imagine a continuous transformation from prior to posterior, for 
instance as proposed in [39]. 

Here we focus on finite sequences of intermediate distributions. There are many 
ways that one could create such a sequence: iterating over the accuracy of the forward 
model, gradually introducing components of the data vector d, or iterating over the 
scale of the noise covariance in the likelihood function. We choose the latter tactic, 
beginning with a large covariance and introducing k stages such that the variance 
of the last stage is the true S of the original inference problem: 

> > • ■ ■ > S'^ = S (36) 
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For a given noise variance E*, a composite map 

0*^f of-^o.-.o/i (37) 

is computed such that it maps the prior distribution /iq (with density p) to the inter- 
mediate posterior distribution /Xj (with density g*) corresponding to ith-stage noise 
leveL Then we can minimize, e.g., 

min Dkl (p \ \ iq' o 0') I det I ) (38) 

with a regularization term added above as needed. Here only the parameters of 
the zth map /* are optimization variables! Coefficients of the preceding maps are 
fixed, having been determined in the previous stages' optimization problems. See 
Figure 3 for a schematic. The resulting map f = (f)^ = o f^~^ o ■ ■ ■ o will have 
been computed with significantly less effort (and fewer degrees of freedom) than a 
total-order polynomial of the same maximum degree. 

One advantage of the sequential construction above is that the accuracy of an 
intermediate map, transforming the prior to any of the intermediate distributions 
g*, does not affect the accuracy of the final map. Because the prior density p always 
appears in (38), the final stage can be be computed with tight error tolerances to yield 
the appropriate measure transformation. Intermediate maps may be computed with 
looser tolerances, as they serve in a sense only to find distributions that are closer in 
shape to the actual posterior. Any error in an intermediate / can be compensated for 
by the final f'^. In our implementation, we have observed that this scheme is robust 
and that it allows for simple error control. 

Alternatively, one could think of computing the map one stage at a time — i.e., at 
each step computing a transformation /* from to by minimizing 

mmDKL{q'-'\\ {q' o f) \detDJ'\). (39) 

pi ^ ^ ' ' ' ' 

The final map is again f = ip^ = /'^o/^~^o- ■ - o/^. A possible advantage of this second 
construction is that one needs to consider only a single map at a time. Each stage 
must be solved exactly, however, as the construction depends on satisfying /Xj = 0J/io 
for all i. If 0* does not push forward /^o to /Xj, then any error incurred in the first i 
stages of the sequence will remain and corrupt the final stage. Also, because the error 
tolerances in the intermediate stages cannot be relaxed, the difficulty of computing 
any of the intermediate maps could conceivably be as great as that of computing the 
full map. For these reasons, we prefer the first construction (38) and will use it when 
demonstrating composite maps in the numerical examples (Section 4.3) below. 

4. Results 

We now demonstrate map-based inference on a series of example problems, ranging 
from parameter estimation in linear-Gaussian models (allowing verification against a 
known analytical solution) to nonlinear ODEs and high-dimensional PDEs. 
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4.1. Linear- Gaussian model 

Here we demonstrate the accuracy and convergence of our method on over- and 
under-determined hnear-Gaussian models. We evaluate both the optimal transport 
formulation and the triangular formulation. 

Consider a linear forward model h{x) = Ax, where the parameters x are endowed 
with a Gaussian prior, x ~ A^(0, Sp), and additive measurement noise e yields the 
data: 

d = h{x) + e = Ax + e. (40) 

If the noise is further assumed to be Gaussian, e ~ N{0,I1n), then the posterior is 
Gaussian, and the map / is linear and available in closed form: 

fix) = zo + Zix (41) 

with 

ZiTipZf = Tipost = {A'^Tij^^A + Tip^^ 

zo = //post = SA^S^'t^. (42) 

In addition, the evidence can be computed in closed form: 

P = exp (^-^ {d^^-^^d - /ipost^post/^post)^ V|detS^|. (43) 

A detailed derivation of the previous relations is given in Appendix B. 

We first test the two optimization formulations (the penalized optimal transport 
objective of Section 2.3 and the triangular construction of Section 2.4) on an over- 
determined inference problem with 10 parameters and 16 observations. The forward 
model, represented by A G ]R^^^^°, is randomly generated with entries of A inde- 
pendently sampled from a standard normal distribution. We use an identity prior 
covariance Sp = / and put E^v = cr^I with a = 0.06. 

With both formulations (Problem 2.4 and Problem 2.5), we observe convergence 
to a map matching (42) after 15 iterations. The evidence /3 also converges to its 
analytical value, to within machine precision. With the optimal transport formula- 
tion, we observe that our Zi differs from the symmetric matrix square root of the 
posterior covariance by roughly 5%. More precisely, the Frobenius norm || ■ ||p of the 
difference, \\Zi — Sp^g^Hp, is roughly 5% of ||SpQgt||p. When the triangular construc- 
tion is employed, the Frobenius norm oi Zi — L is less than 10~®||L||p, where L is 
the lower-triangular Cholesky factor of Epost, i-e., Spost = LL^. With the optimal 
transport formulation, we have further observed that if the map / is forced to remain 
symmetric and if the penalty factor A is gradually reduced to zero, then Zi matches 
^post with a relative error of less than 10~^. 

Next we consider a larger but under-determined linear-Gaussian problem with 100 
parameters and 8 observations, again randomly generating A. Using the triangular 
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construction, we compute the map. Convergence plots of Var [T] and the Kullback- 
Leibler divergence D^l {p\ \P) are given in Figure 4, while convergence of the evidence 
P is shown in Figure 5. The optimization iterations begin with the identity map, 
and the variance and KL divergence decrease rapidly to zero as the correct map 
is identified. (Note that KL divergence is not shown in the first five iterations of 
Figure 4, because the sample-estimated divergence is infinity.) Near convergence, it 
is observed that the value of the Kullback-Leibler divergence is approximately half 
the variance of T(X). This relation is proven in Appendix A. 

4-2. Reaction kinetics 

This example demonstrates map-based inference on a nonlinear problem that 
yields a non-Gaussian posterior, with a sharp lower bound and strong correlations. 
The problem is two-dimensional, enabling direct visualization of the map. We exam- 
ine the impact of truncating the polynomial order on convergence and monotonicity 
of the map. 

The objective of the problem is to infer the forward and reverse rates of reaction 

ki 

in the chemical system A ^ B The governing equations are as follows, with u 



representing the concentration of component A and v the concentration of component 



The initial condition is fixed at u{0) = 1 and v{0) = 0. The rate parameters ki and 
^2 are endowed with independent Gaussian prior distributions, ki ~ N{2,200^) and 
k2 ~ A^(4, 200^). The "true" parameter values are set to ki = 2, k2 = 4, and synthetic 
data consisting of noisy observations of u at times t = 2, 4, 5, 8, and 10 are generated. 
Observational noise is assumed to be i.i.d. Gaussian. Because the observations occur 
relatively late in the dynamics of the ODE initial value problem, the only information 
which can be inferred is the ratio of ki and — i-e., the equilibrium constant of the 
system. 

We compute a map using the algorithm detailed in Section 3. The map is rep- 
resented using a total-order polynomial expansion. We begin with a linear map and 
progressively increase its order; at the end of iterations with a 5th-order map (no = 5), 
the Kullback-Leibler divergence Dkl [pWp) is less than 10~^. The algorithm requires 
a total of 30 inner-loop optimization steps to reach this level of accuracy. 

Figure 6 shows 10^ samples from both the prior and posterior distributions. As 
expected, the posterior samples concentrate around the line k2 = 2ki. Also, posterior 
samples are localized in the upper left quadrant of the ki-k2 plane, which corresponds 
to stable trajectories of the ODE. In both Figures 6(a) and 6(b), sample points at 
which the determinant of the Jacobian D^f is negative are shown in red. Elsewhere 



B: 



du 

Itt 
dv 

~dt 



kiu — k2V . 



—kiu + k2V 



(44) 
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(blue points), the Jacobian determinant is positive. This sign change corresponds to 
regions of the parameter space where monotonicity of the map is lost; we note that 
all of these points are relegated to the tails of the prior distribution. Only 0.08% of 
the samples have negative determinant. 

Figures 7(a) and 7(b) show components of the map itself, via surface plots of 
fi{ki, and f2{ki, ^2). The two components of the map are nearly identical up to a 
scaling factor of two, which is to be expected, as the posterior distribution contains 
a strong correlation between the parameters. 

4-3. Genetic toggle switch 

In this example we demonstrate the composite map on a six-dimensional nonlinear 
parameter inference problem using real experimental data. We compare the accuracy 
of our results to those obtained with a standard MCMC algorithm. 

The example involves the dynamics of a genetic "toggle switch" [53]. The toggle 
switch consists of two repressible promotors arranged in a mutually inhibitory net- 
work: promoter 1 transcribes a repressor for promoter 2, while promoter 2 transcribes 
a repressor for promoter 1. Either repressor may be induced by an external chemical 
or thermal signal. Genetic circuits of this form have been implemented on E. coli 
plasmids, and the following ODE model has been proposed [53]: 



— u 



du 


ai 


~dt 


l + vl^ 


dv 


a2 


'dt 


1 + 


w 


= u(l + 



— V 

[1PTG]\ 



K 



) 



(45) 



Here u is the concentration of the first repressor and v is the concentration of the 
second repressor. [IPTG] is the concentration of IPTG, the chemical compound that 
induces the switch. At low values of [IPTG], the switch is in the 'low' state, refiected 
in low values of v; conversely, high values of [IPTG] lead to strong expression of v. 
As in [54], we would like to infer the six model parameters ai, 0:2, P, 7, rj and k. To 
this end, we employ actual experimental data^ consisting of normalized steady-state 
values of v at selected IPTG concentrations, spanning the 'low' and 'high' sides of 
the switch: [IPTG] G {lO'^, 0.6, 1, 3, 6, 10} x 10"^ 



^Data are courtesy of Dr. T. S. Gardner. 
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At steady state, the model yields the following relations 



u 



w = M 1 + 



1 + w'y 

[IPTG] \ 



K 



which can be expressed as an implicit function v = g{v, where ^ = (ai, 02, P, 7, rj, k). 
The elements of ^ are endowed with independent uniform prior distributions, centered 
at the nominal values ^0 suggested in [53]: 

^0 = (156.25, 15.6, 2.5, 1, 2.0015, 2.9618 x 10"^) . 

In other words, we have 

= ^0,^ (1 + CT,e^) (46) 

where 6* is a vector of uniformly-distributed random variables, 6i ~ f/(— 1,1), and 
entries of a are (0.20,0.15,0.15,0.15,0.30,0.20). As detailed in [54], the observational 
error is assumed Gaussian and zero-mean, with a standard deviation that depends on 
whether the expression level is low or high. This simplified error model is consistent 
with experimental observations. 

The use of uniform priors adds an extra constraint on the map: since the prior 
support is a unit hypercube (with appropriate scaling), the range of the map must 
be an improper subset of the unit hypercube, just as the support of the posterior is 
contained within the support of the prior. This constraint is difficult to satisfy when 
the map is approximated using global polynomials. To circumvent this difficulty, we 
first map the uniform random variables 6 to independent standard normal random 
variables x ~ A^(0, J) using the error function: 



6 = erf ( X 



)~f/(-l,l) (47) 



l + fxerf /(x)/V2 (4J 



Computation of the map can now proceed using a Gaussian prior on the input x. After 
computing the map f{x), the posterior random variable ^post is obtained through the 
same transformation: 

^post = ^0 

All derivatives of ^ with respect to x are computed analytically using (46) and (47), 
e.g., 

, V2 f x^\ 
_i.5,,_exp -- . (49) 
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Using the implicit expression for v, we can compute the derivatives of v with 
respect to any model parameter ^j: 

dv dg dg dv 

d^i d^i dv d^i 



dv dg ( dg 



-1 



The second derivatives are: 



d'^v dg^ ^ d'^g dv ^ f d^g dv ^ d^g \ dv ^ dg d'^v 



d^iQ d^id^j d^idv d^j \dv'^ dQ dvdQ J d^i dv d^id^j 
d'^v ( dg\^^ ( d^g d^g dv fd^gdv d'^g \ dv 



V dv) \dm, ' di.dvdi, ' Kdv^di,"" dvdi,) dij- ^^^^ 

To compute derivatives with respect to the transformed Gaussian random variables 
X, the chain rule is applied: 

dv dv d^ 
dx d^ dx 

d'^v dv d"^^ ^ f d^\^ d'^v d^ 



dx"^ d^ dx"^ \dx J d^"^ dx 

With these transformations in place, we turn to the numerical results in Figures 8- 
10. In this problem, a high-order map is required to accurately capture the posterior 
distribution, and thus we employ the composite map of Section 3.5. In particular, 
we compute the overall map using a four-stage cascade of third order maps, / = 

o ■ ■ ■ o The distributions obtained after each stage are shown in Figures 8(a)- 
8(d), using a scatter plot of output samples from each map (0^, 0^, etc). While the 
posterior is six-dimensional, these plots focus on the pairwise marginal distribution 
of «! and 7; this is the most "complex" pairwise marginal and is particularly sensitive 
to the accuracy of /. These distributions are shown on the transformed Gaussian 
domain, and thus they correspond to the first and fourth components of x: Xa^ and 
x^. For comparison. Figure 8(e) shows results obtained using a long run of delayed- 
rejection adaptive Metropohs (DRAM) MCMC [11], with 10^ samples. 

In Figures 9(a)-9(f) we show the marginal posterior probability density of each 
component of C,, obtained at the end of the four-map sequence. These densities are 
compared with densities obtained using the long MCMC run; the map-based results 
show good agreement with the latter. Figure 10 shows samples from the joint prior 
of Xa-^ and Xa2- As specified above, these parameters are independent and standard 
normal. The point of this figure is not the distribution per se, but rather the color 
labeling, which diagnoses the monotonicity of the map. The map is evaluated on 10^ 
samples from the prior. Among these samples, only 861 have Jacobian determinants 
that are negative; all the rest are positive. As in the previous example, the few samples 
with negative Jacobian determinant are concentrated in the tails of the distribution. 
Away from the tails, the map is monotone. 
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4.4- PDE- constrained inverse problems 



Examples in this section focus on high-dimensional inverse problems involving 
partial differential equations (PDEs). In particular, our objective is to estimate a 
spatially heterogeneous coefficient k appearing in an elliptic PDE, from noisy and lim- 
ited observations of the solution field [55]. We solve the problem on one-dimensional 
(Section 4.4.1) and two-dimensional (Section 4.4.2) spatial domains. We employ the 
triangular parameterization of the map, performing quantitative comparisons of com- 
putational efficiency and accuracy with MCMC for a range of data sets and observa- 
tional noise magnitudes. 

4.4-1- One- dimensional domain 

Consider a linear elliptic PDE on the unit interval S = [0, 1] 

V ■ {k{s)Vu) = -f{s) (52) 

where s E S is the spatial coordinate, V = d/ds, and / is a known source term. 
In the subsurface context, this equation describes Darcy flow through porous media, 
where k represents a permeability field and u is the pressure. We apply boundary 
conditions „ = and u\ , = 1. The source term is given by: 

as I s=0 ls=l 

f{s) = aexp (^-^{s - s„,f^ (53) 

with a = 100, Sm = 0.5, and b = 0.01. 

We place a log-normal prior on the permeability field k, 

\og[K{s,u)-Ko]^gV{0,C) (54) 

and let the covariance kernel c{s, s') of the Gaussian process have exponential form 

c (s, s') = 0"^ exp ^— — — — . (55) 

^ We use a prior standard deviation of a = 1.5 and a correlation length = 0.5, 
along with an offset kq = 0.5. Realizations of this spatial process are rough; in fact, 
they are not mean-square differentiable. 



^Note that c(s, s') is the Green's function of the differential operator 

Lr <f 1 



(56) 



with appropriate boundary conditions [56, 57). Hence the inverse covariance operator C ^ is expHc- 
itly represented by (56). 
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The problem is spatially discretized using finite differences on a uniform grid, with 
spacing As = 1/100. The log-normal permeability field is stochastically discretized 
using the Karhunen-Loeve expansion of the underlying Gaussian process: 

n 

log[K(s,u;) - Ko] ^ y^^(t)i{s)^/\iXi{uj) (57) 

i=l 

where Xi ~ iV(0, 1) are independent standard normal random variables, and 0j, Aj 
are the eigenfunctions and eigenvalues of the linear operator corresponding to the 
covariance kernel: J^c{s,s')(j)i{s)ds = Aj0(s'). We discretize the eigenfunctions on 
the same grid used to discretize the PDE solution field u. To capture 99% of the 
spatially-integrated variance of the log-normal process, we retain n = 66 Karhunen- 
Loeve modes. 

Noisy observations of u are obtained at m points in space. The noise is additive 
and i.i.d. Gaussian, such that dj = u{sj] x) + ej with Sj ~ N{0, cx^), j = 1 . . .m. The 
observational data is generated by choosing a "true" permeability field k, solving the 
full PDE model (52) to obtain the corresponding pressure field u{s), then corrupting 
u{s) at the m measurement locations with independent realizations of the noise. In 
the inference process, on the other hand, we use the polynomial chaos expansion (58) 
as the forward model. This discrepancy ensures that we do not commit an "inverse 
crime" [3]. 

Any Bayesian inference strategy, whether the map-based optimization approach 
or MCMC, requires repeated evaluations of the forward model. As suggested in [24], 
exploiting regularity in the dependence of u on the parameters x can make these 
evaluations more computationally tractable. We therefore approximate u{s] x) using 
a polynomial chaos expansion. We apply the iterative algorithm developed in [58], 
for the solution of high-dimensional stochastic PDEs, to obtain an approximation of 
u in the form: 

u{s;x) = ^Mk(s)V^k(a;) (58) 
ke/c 

where are coefficients and ip]f_{x) are multivariate Hermite polynomials, chosen 
adaptively within a very large basis set /C. 

Algorithm 1 allows the expansion order of the map f{x) to be increased in stages; 
i.e., we begin by finding a linear map, then a cubic map, and so on. Since infer- 
ence problems involving distributed parameters are typically high-dimensional (in 
the current problem / maps M^^ onto itself, for example), writing / as a total-order 
polynomial expansion in all n of its components will lead to a very large number of de- 
grees of freedom, most of which are not needed to achieve an accurate representation 
of the posterior. Instead, we refine the polynomial description of the map in a more 
finely-grained fashion. Recall that f{x) = F-^\E'(x), where \E' is a vector of orthogonal 
polynomials in Xi . . In the structure of Algorithm 1, decisions to expand the 
polynomial description of the map are made outside of the inner optimization loop. 
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after checking whether Var[T] satisfies the desired threshold S. Now, rather than rais- 
ing the polynomial degree of the entire map (e.g., ^ + 2), we choose a subset 
of the inputs {x, : i G 1} and introduce higher-degree polynomial terms in these 
variables only. The index set X initially consists of {1,2, . . . ,ii < n}. At the next 
iteration, if the variance threshold is still not satisfied, ii is replaced with a larger 
index i2, where ii < 12 < n. This process continues until the largest element in X is 
n or until Var[T] stagnates. Now X is reset and new polynomial terms of even higher 
degree, involving only Xi through added to the expansion. Then the index 

set is expanded once again. In any of these iterations, adding terms to the expansion 
is equivalent to adding rows to and to the matrix of polynomial coefficients F. 

To make this process more concrete, consider what happens in the present infer- 
ence problems. We begin by solving for the polynomial coefficients of a linear map 
(possibly subject to the triangular constraint). The optimal linear map is not able to 
reduce Var[T] below the requisite threshold. Polynomial terms of total order riQ = 3 
in the first ii = 10 (for example) components of x are then introduced, and all the 
coefficients collected in F are adjusted via optimization. This refinement of / is still 
not sufficient, so now the order-3 expansion is extended to the first ^2 = 20 (again for 
example) components of x. Extension to additional components of x is observed to 
yield little decrease in the minimal value of Var[T], so now the polynomial space is 
enriched by adding terms of total order no = 5 in the first ii components of x. Iter- 
ations continue until convergence, resulting an "adapted" map / whose components 
are a subset of a total-order expansion in Xi . . . Xn- 

Results of our algorithm are shown in Figures 11-13, where we also report com- 
parisons with MCMC We consider three cases, corresponding to different numbers of 
observations m and different noise levels a^. In Case I, we have m = 31 observations 
and a noise standard deviation of o"„ = 0.05; in Case II, we increase the number of 
data points to m = 101 and retain o"„ = 0.05; in Case III we keep m = 101 observa- 
tions and reduce the noise standard deviation to o"„ = 0.01. In all cases, the maximum 
polynomial order of the map is 5 and the optimization routine is terminated when 
Var[r] < 6 = 0.1. We use the triangular formulation and a single map (rather than 
a composite map) for these problems. 

Figures ll(a)-ll(f) plot the posterior mean and standard deviation of the log- 
permeability field as computed with optimal maps and with MCMC. The "truth" 
log-permeability field, used to generate the data, is shown in black. As expected in 
this ill-conditioned problem, only the smooth part of the permeability field can be 
reconstructed. As the number of data points increases and the noise level decreases, 
however, more features of the permeability field can be recovered. The posterior 
covariance is non-stationary, and we note that the posterior variance is much smaller 
at the right boundary, where there is a Dirichlet boundary condition, than at the 
left boundary, where a zero Neumann condition was imposed. Overall, posterior 
uncertainty decreases from Case I to Case III, reflecting additional information in 
the data. Good agreement between the map and MCMC results is observed, though 
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MCMC yields a slightly larger standard deviation in the left half of the domain. The 
current MCMC results were obtained using 10^ samples of DRAM, with the first 
5 X 10^ samples discarded as burn-in. Simple MCMC convergence diagnostics suggest 
that this is an adequate number of samples, but it is not entirely clear which set of 
results is more authoritative. Indeed, we observed that DRAM failed to converge 
in almost half the attempted runs of Case III. (These runs were initiated from the 
prior mean; differences among the attempted runs result from randomness in the 
proposals.) On the other hand, the optimal map algorithm reliably converges to the 
desired tolerance 6. 

Figure 12 shows posterior realizations for Case I and Case III, as computed with 
the map and with MCMC. Note that the realizations are different in character than 
the posterior mean; they are significantly rougher, as one would expect given the 
exponential covariance kernel in the Gaussian process prior. But the map and MCMC 
results yield posterior realizations of similar character. 

In Figure 13 we plot the posterior covariance of log {n — kq) from Case I. We choose 
this case because of the apparent discrepancy in posterior standard deviations shown 
in Figure 11(b). Figure 13(a) is a surface plot of the posterior covariance as computed 
using our map algorithm. The exponential prior covariance has a discontinuity in its 
derivative at the diagonal, smoothed slightly due to truncation at a finite number of 
Karhunen-Loeve modes. This feature is preserved in the posterior, reflecting the fact 
that posterior realizations are also rough and that data are not informative at the 
smallest scales. The overall scale of the covariance is reduced significantly from prior 
to posterior, however. While the prior covariance was stationary with o"^ = 1.5, the 
posterior shows smaller variance throughout the spatial domain. 

In Figure 13(b) we compare the contours of posterior covariance obtained with the 
map to those obtained using MCMC algorithm. The 16 contour levels range uniformly 
from —0.1 to 1.4. Contours computed using the two methods are remarkably similar. 
It should be emphasized that finding the map / enables the posterior covariance to 
be computed analytically. 

4.4-2. Two-dimensional domain 

To explore the performance of our algorithm in a more challenging setting, we 
solve the same inverse problem as in Section 4.4.1 but on a two-dimensional spatial 
domain S = [0,1]^. The governing equation is still (52), but now s = (si,S2) G S 
and V = {d/dsi,d/ds2). We apply deterministic Dirichlet boundary conditions on 
all four sides of S, with m(0,0) = 0, u{0,l) = 1, u{l,l) = —1, m(1,0) = 2 and a 
linear variation on dS between these corners. The source term / is of the form (53), 
centered at Sm = (0.5,0.5) and with width b = -\/l0/40. The equation is discretized 
on a 41 X 41 grid. 

Again we place a log-normal prior on the permeability field (54) with kq = 1, 
choosing an isotropic exponential covariance kernel c(s, s') with a = 1.25 and Lc = 
0.5. To capture 95% of the spatially integrated variance of the prior permeability 
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field, the next two cases employ n = 39 Karhunen-Loeve modes. 

Two different data configurations are considered. The first (Case A) uses m = 121 
observations randomly scattered on the spatial domain, with noise standard deviation 
an = 0.08; the second (Case B) involves m = 234 randomly scattered observations and 
noise standard deviation cr„ = 0.04. As in the one-dimensional problem, a polynomial 
chaos approximation of u{s, x) is used as the forward model for inversion. 

We first focus our analysis on the posterior distribution of the Karhunen-Loeve 
mode weights {zi}, where z = f{x) and Xi ~ A^(0, 1). Figure 14(a) shows a boxplot 
of the mode weights computed using the map. The extent of each blue box marks the 
25% and 75% quantiles of the posterior marginal of each Zi, while the vertical lines 
or "whiskers" span the entire range of the posterior samples drawn via the map. We 
also plot the posterior mean of Zi obtained with the map (circle) and with MCMC 
(an x). Finally we show the mode weights corresponding to the "true" permeability 
field used to generate the data, before the addition of observational noise. The map 
and MCMC means agree reasonably well. Comparing the results of inference with the 
true weights, it is observed that the first few modes are accurately identified, whereas 
the higher-index modes are not. This is because the higher-index modes are rougher; 
they correspond to higher-frequency components of the permeability field, which are 
smoothed by the elliptic operator and therefore difficult to identify from observations 
of u. This trend is further demonstrated by Figure 14(b), which shows the posterior 
standard deviation of each mode, computed with the map and with MCMC. As the 
mode indices increase, their posterior variances approach unity — which is exactly the 
prior variance on Xi. Thus the data contained little information to constrain the 
variability of these modes. 

Turning from analysis of the Karhunen-Loeve modes to analysis of the permeabil- 
ity field itself. Figure 15 shows the truth log-permeability field used to generate the 
data (which here reflects truncation to n = 39 Karhunen-Loeve modes). Figure 16(a) 
shows the posterior mean log-permeability obtained with the map in Case A. Given 
the smoothing character of the forward model, the presence of noise on the data, and 
mismatch between the full PDE model used to generate the noiseless data and the 
PC approximation used for inversion, we should not expect the posterior mean and 
the true permeability field to match exactly. Indeed, the posterior mean matches the 
true field in its large-scale behavior, but most of the localized or small-scale features 
are lost; the corresponding mode weights necessarily revert to their prior mean of 
zero. Figure 16(b) shows the posterior standard deviation of the log-permeability as 
computed with the map, while Figures 17(a) and 17(b) show the posterior mean and 
standard deviation fields computed with MCMC. Results obtained via the two algo- 
rithms are very similar. The computational time required to solve the optimization 
problem for the map, with tolerance 6 = 0.5, is equal to the time required for 5 x 10^ 
steps of MCMC. 

Figures 18-20 show analogous results for Case B, with roughly twice as many ob- 
servations and half the noise standard deviation of Case A. The boxplot of Karhunen- 



26 



Loeve (KL) mode weights in Figure 18(a) shows that a larger number of modes (at 
higher index) are accurately identified, compared to Case A. In Figure 18(b), the 
posterior standard deviation of the modes is less than in Case A, refiecting the ad- 
ditional information carried by the data. Figure 19(a) shows the posterior mean 
log-permeability obtained with the map for Case B; though this field is still smooth, 
more of the small-scale features of the true permeability field are captured. MCMC 
results shown in Figure 20 are quite similar to those obtained with the map. 

The MCMC results reported here were obtained using a DRAM chain of 10^ 
samples, half of which were discarded as burn-in. We make no claim that this is most 
efficient MCMC approach to this problem; certainly a carefully hand-tuned algorithm 
could yield better mixing. However, we do claim that it represents good MCMC 
performance. Because the inference problem has been transformed to the Karhunen- 
Loeve modes, which have unit variance and are uncorrelated in the prior, the adaptive 
Metropolis algorithm starts from a favorable parameterization with known scaling. 
Even with a good parameterization and an adaptive algorithm, however, MCMC 
mixing remains a challenge in these ill-posed inverse problems. Figure 21 shows the 
effective sample size (ESS) of each component of the chain corresponding to = 
5 X 10^ post burn-in samples. The effective sample size is computed by integrating 
the chain autocorrelation pi at lag i: 

ESS ^ (59) 

The ESS for most of the chain components lies between 1500 and 4000. In order 
to obtain a reasonable number of posterior samples — e.g., for an accurate estimate 
of a posterior moment, or to propagate posterior uncertainty through a subsequent 
simulation — an unreasonable number of MCMC steps is thus required. For instance, 
if one desires 10^ effectively independent posterior samples, then at least 30 million 
MCMC steps are needed (here, corresponding to about 2 days of simulation time). On 
the same computational platform and for the same problem, using the map algorithm 
to generate 10^ independent samples requires about 45 minutes of wallclock time to 
solve the optimization problem and construct /, followed by 5 minutes to pass 10^ 
prior samples through the map and generate the desired posterior samples. This 
corresponds to a factor of 50 reduction in computational time. 

We also note that when the noise standard deviation is reduced to = 0.01, then 
the adaptive MCMC algorithm fails to converge, producing a chain with near-zero 
acceptance rate regardless of how many times we rerun it (starting from the prior 
mean). On the other hand, the map algorithm has no trouble converging to the 
desired accuracy 6 in roughly 55 minutes of wallclock time (only 10 minutes more 
than that required for Case B). 

As a final example we consider a more refined stochastic discretization of the prior 
log-normal process. We now retain n = 139 Karhunen-Loeve modes, such that 97.5% 
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of the input permeability field's integrated variance is preserved. We make m = 227 
noisy observations, with a noise standard deviation of cr„ = 0.04. In this example 
we focus on comparing the computational performance of MCMC and map-based 
inference. 

Two different MCMC chains are run, each of length 10^ samples. Both chains start 
from the prior mean and proceed with adaptive random-walk proposals. It is observed 
that the burn-in period is relatively long; we thus remove half the chain and use the 
remaining 5 x 10^ samples to compute all quantities below. Each chain takes approx- 
imately 5 hours to produce. The map algorithm is run until the variance of T{X) is 
less than 1% of the mean of T{X); this corresponds to a KL divergence Dkl {pWp) 
of 1 nat. The computational time required to solve the optimization problem to this 
threshold is less than 3 hours. 

Figure 22 shows the posterior mean values of the Karhunen-Loeve modes com- 
puted using MCMC and the map. The two MCMC runs shown in Figure 22(a) differ 
significantly in their higher-index modes, indicating that these components of the 
chain mix rather poorly. Comparing the map-based posterior means with the "truth" 
shows, as usual, that the smoother modes are well identified in the posterior while 
the rougher modes are not, reverting instead to the prior. Poor mixing of the MCMC 
algorithm is also evident in Figure 23, which shows the posterior standard deviation 
of each mode weight. For mode indices larger than 10, the two MCMC runs yield 
very different standard deviations. And the standard deviations of the higher-index 
modes plateau below 0.8. The discrepancy between the chains suggests that this 
value is not credible, and that the chains are in fact not exploring the full parameter 
space (despite using 10^ samples). One would instead expect the rough highest-index 
modes to revert to the prior standard deviation of 1.0, exactly as observed in the 
map-based results. Indeed, agreement of the prior and posterior distributions on the 
high-index KL modes is a consequence of absolute continuity of the posterior with 
respect to the prior [55]. Limitations of standard MCMC algorithms in this context 
have been discussed in [59]. 

In Figure 24 we compute effective sample sizes for one of the MCMC chains 
from the previous figures. The minimum ESS over all chain components is 275; 
the median, mean, and max ESS are 543, 561, 1100, respectively. Extrapolating 
these numbers lets us determine the total computational (wallclock) time needed to 
compute any number of effectively independent samples via MCMC. Figure 25 thus 
compares the computational effort of MCMC to that required by the map. We neglect 
the computational time associated with any MCMC burn-in period, effectively giving 
MCMC a significant boost in the performance comparison. The time required by 
the MCMC algorithm grows linearly with the number of independent samples. On 
the other hand, the map requires a fixed amount of time at the outset, in order to 
solve the optimization problem, while the computational cost for each new sample 
point is almost negligible. (The latter still grows linearly with the number of samples 
but with a very small slope.) Here, if one is interested in generating fewer than 300 
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independent samples, then MCMC is faster; otherwise finding the optimal map is 
more efficient. 

Figure 26 shows a similar comparison, but uses the number of forward model eval- 
uations as a measure of computational effort, rather than wallclock time. We assume 
that derivatives of the forward model output with respect to the model parameters 
can be computed with an adjoint method, which means that the cost of computing the 
ffrst derivatives is equivalent to approximately two forward solves. Furthermore, per 
the current implementation, we assume that second derivatives of the forward model 
with respect to the parameters are not computed. In this comparison, the break-even 
point is approximately 200 samples. If the desired number of samples is smaller, then 
MCMC is more efficient; otherwise the map algorithm can offer order-of- magnitude 
reductions in computational effort. 

Finally, we note that all of the computations above have used a serial implemen- 
tation of the map-based inference approach. It should be emphasized, however, that 
the bulk of the computational effort involved in solving the optimization problem for 
the map and subsequently generating samples is embarrassingly parallel. 

5. Conclusions 

We have presented a new approach to Bayesian inference, based on the explicit 
construction of a map that pushes forward the prior measure to the posterior measure. 
The approach is a significant departure from Markov chain methods that character- 
ize the posterior distribution by generating correlated samples. Instead, the present 
approach finds a deterministic map / through the solution of an optimization prob- 
lem. Existence and uniqueness of a monotone measure-preserving map is established 
using results from optimal transport theory. We adapt these results and propose two 
alternative optimization formulations, one with an explicit regularization term in the 
objective and one that regularizes the problem by constraining the structure of /. 
The new formulation offers several advantages over previous methods for Bayesian 
computation: 

• The optimization problem provides a clear convergence criterion, namely that 
Var[T(X)] — )■ 0, with T{X) defined in (12). Monitoring this criterion can be 
used to terminate iterations or to adaptively enrich the function space used to 
describe the map, until a desired level of accuracy is reached. 

• The posterior normalizing constant, or evidence, is computed "for free" as an 
output of the optimization problem. 

• Because we describe the map using standard orthogonal polynomials, posterior 
moments may be computed analytically from the polynomial coefficients. 

• Once a map is in hand, arbitrary numbers of independent posterior samples 
may be generated with minimal computational cost, by applying the map to 
samples from the prior. 
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• While the optimization objective involves prior expectations and is thus stochas- 
tic, efficient gradient-based methods (e.g., Newton or quasi-Newton methods, 
with the full machinery of adjoints) can nonetheless be used to solve the opti- 
mization problem. 

• A sequence of low-order maps may be composed to capture the transition from 
prior to posterior; this construction allows a complex change of measure to be 
captured more economically than with a single map. 

We demonstrate the map-based approach on a range of examples. Fast convergence 
to the exact solution is observed in a linear parameter inference problem. We also 
infer parameters in a nonlinear ODE system, using real experimental data, where the 
posterior is of complicated form and differs in shape from the prior. Finally, we use 
the map to tackle several high- dimensional, ill-posed, and nonlinear inverse problems, 
involving inference of a spatially heterogeneous diffusion coefficient in an elliptic PDF. 

Overall, inference with maps proceeds with greater reliability and efficiency than 
MCMC — particularly on high-dimensional inverse problems. Speedup can be quan- 
tified in simple ways, such as counting the computational effort required to produce 
a certain number of effectively independent posterior samples. In the present prob- 
lems, the cost of computing the map with typical tolerances is equivalent to obtaining 
roughly 200 independent samples with MCMC. But these comparisons are necessarily 
insufficient, because inference with maps provides more information and more useful 
diagnostics than MCMC, as detailed above. 

Several promising avenues exist for future work. There is ample opportunity 
to improve the efficiency of the optimization algorithms. First, we note that each 
optimization step can be made embarrassingly parallel, as it relies on prior sampling. 
Among the most computationally intensive elements of the optimization procedure 
is the evaluation of the forward model h, composed with the map, on each prior 
sample. Parallelizing these computations and their corresponding adjoint solves would 
lead to immediate and substantial computational gains. More efficient optimization 
approaches may also employ importance sampling to compute the variance or mean 
of T, or introduce stochastic expansions for T itself. 

The map itself is a polynomial chaos expansion of the posterior distribution, and 
thus it is readily suited to propagation of posterior uncertainty through subsequent 
dynamics. With this posterior propagation step comes an immediate extension to 
recursive inference problems, i.e., filtering and prediction with sequential data. More- 
over, in the present work, we have focused on measure transformations from prior 
to posterior, but one could also create maps that push forward some third "base" 
measure to both the posterior and prior. Such a construction could be useful when 
it is not convenient or easy to generate independent prior samples, or if the prior is 
improper. 

There are several types of static inference problem for which the map approach 
must be further developed. The examples presented in this paper had only one 'level'; 
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extensions of map-based inference to include hyperparameters, or more generally, to 
exploit the structure of Bayesian hierarchical models, are currently ongoing. Another 
challenge arises when the posterior has bounded support, with significant probability 
mass accumulating near a boundary. The range of the map must be appropriately 
bounded in such cases. Thus far we have circumvented such problems by reparame- 
terization, transforming bounded domains into unbounded ones. This trick comes at 
a computational price, e.g., greater nonlinearity. We ask, then, how more directly to 
bound the range of the map, whether on a bounded or unbounded input domain, via 
constraints in optimization or perhaps a non-polynomial basis. 

We would also like to better understand the limiting behavior of the map in ill- 
posed and high-dimensional problems. When inferring distributed parameters with 
some meaningful correlation structure — in the elliptic inverse problem, for example — 
there is a natural ordering of the random variables and adaptive enrichment of the 
map works well. But in high-dimensional nonlinear problems with no natural order- 
ing (for instance, nonlinear ODE systems with hundreds of unknown parameters), 
a high-order polynomial expansion in all the modes is computationally prohibitive. 
Further exploration of adaptive methods, perhaps coupled with dimensionality re- 
duction, is needed. We ask, for example, whether inferential maps have a low-rank 
tensorial representation or a sparse representation in some basis. Finally, we note 
that generating multi-modal posterior distributions from unimodal priors will require 
more localized structure in the maps than global polynomial basis functions can feasi- 
bly provide. Piecewise polynomials, multi-element polynomial chaos [60], and similar 
representations may be quite useful in such problems. 
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Figure 1: Schematic of a function / that pushes forward the prior probabihty distribution (rep- 
resented by density p) to the posterior probabihty distribution (represented by density q). The 
objective of our algorithm is to efficiently compute such a function in high dimensional spaces. 
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Figure 2: Pictorial representation of the optimization scheme. A candidate map / transforms the 
posterior density q into an approximation p of the true prior density p. The map is adjusted to 
minimize the distance between p and p: when this distance is brought below a prescribed threshold, 
optimization iteration termainate and one has obtained the desired map. 
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Figure 3: A pictorial representation of the composite-map algorithm. Successive stages represent a 
gradual transition from prior to posterior, imposed, e.g., via a sequence of noise levels, or by iterating 
through the data set or the fidelity of the forward model. 



38 




Figure 4: Linear-Gaussian problem: Variance and Kullback-Leibler divergence versus iteration 
ber. The magnitude of both quantities converges to machine precision after 12 iterations. 
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Figure 5: Linear-Gaussian problem: DiflFerence between the exact evidence /? and the evidence 
E[T(X)] computed with the map algorithm. The evidence converges to the exact value in 12 itera- 
tions. 
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(a) Samples from the prior distribution. 
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(b) Corresponding samples of the posterior distribution. 

Figure 6: Reaction kinetics problem: samples from the prior and posterior distributions. Red dots 
represent samples at which the determinant of the Jacobian of the map is negative. The determinant 
is positive for 9992 out of 10000 samples. The posterior density is concentrated along a line of slope 
2, which is the ratio fc2/fci 
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(a) First component of the map. 




(b) Second component of the map. 

Figure 7: Reaction kinetics problem: two-dimensional transformation / from X ^ fj,Q (the prior 
random variable) to Z ^ fi (the posterior random variable). Due to the strong correlation between 
zi and Z2 , both figures look almost identical up to a multiplicative factor of two. 
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(a) First stage: noise level — 16S (b) Second stage: noise level Y? = SE 
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(c) Third stage: noise level Y? = 21] (d) Fourth stage: noise level S"' = S 



(e) MCMC samples 

Figure 8: Toggle switch problem: posterior samples of and computed using four stages of the 
composite map algorithm. For comparison, we also show posterior samples obtained with adaptive 
MCMC. 
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(e) Posterior pdf of rj. 



(f) Posterior pdf of k. 



Figure 9: Toggle switch problem: Marginal posterior probability density functions of each model 
parameter, computed using the map and compared with MCMC. 
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Figure 10: Toggle switch problem: colors represent the sign of the determinant of the Jacobian of 
/ (the final composite map). Only 861 out of 10^ points, shown in red, have negative Jacobian 
determinant. These points lie in the low probability region of Xa-^. 
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(a) Case I: Posterior mean of log(K — kq). (b) Case I: Posterior std of log(K — kq) 
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(c) Case II: Posterior mean of log(K — kq) (d) Case II: Posterior std of log(K — kq) 
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(e) Case III: Posterior mean of log(K — ko) (f) Case III: Posterior std of log(K — ko) 



Figure 11: One-dimensional elliptic PDE: results of the map algorithm compared to results of MCMC 
for three different data cases, detailed in the text. The cases are: (I) fewer data points and larger 
noise variance; (II) many data points and larger noise variance; (III) many data points and smaller 
noise variance. MCMC experiences significant convergence difficulties in the final case. 
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Figure 12: One-dimensional elliptic PDE: four posterior realizations (colored lines) from Case I and 
Case III, computed with the map and with MCMC. The true permeability field is shown in black 
on all figures. 
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(a) Surface plot of the posterior covariance C(si, S2), calculated with the map. 




Spatial coordinate 

(b) Map (thick contours) versus MCMC (thin contours). 

Figure 13: One-dimensional elliptic PDE: posterior covariance of the log-permeability field, Case I. 
The lower plot compares results obtained with the map against results obtained with MCMC, for a 
fixed set of contour levels. 
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(a) Boxplot of Karhunen-Loeve mode weights, obtained with the map. Superim- 
posed are posterior means obtained with the map and with MCMC, along with 
truth values of the weights. 
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(b) Posterior standard deviation of the Karhunen-Loeve mode weights, 
map versus MCMC. 



Figure 14: Two-dimensional elliptic PDE: 121 observations, (t„ = 0.08. Posterior distribution of the 
Karhunen-Loeve modes of the log-permeability, as computed with the map and with MCMC. 
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Figure 15: Two-dimensional elliptic PDE: truth log {k — kq) 
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(a) Posterior mean of log (k — kq) 




(b) Posterior standard deviation of log (k — kq) 



Figure 16: Two-dimensional elliptic PDE: 121 observations. (t„ = 0.08. Posterior mean and standard 
deviation of log (k — kq) computed with the map. 
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(a) Posterior mean of log (k — kq) 




(b) Posterior standard deviation of log (k — kq) 



Figure 17: Two-dimensional elliptic PDE: 121 observations. (t„ = 0.08. Posterior mean and standard 
deviation of log (k — kq) computed with MCMC. 
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(a) Boxplot of Karhunen-Loeve mode weights, obtained with the map. Superim- 
posed are posterior means obtained with the map and with MCMC, along with 
truth values of the weights. 
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(b) Posterior standard deviation of the Karhunen-Loeve mode weights, 
map versus MCMC. 



Figure 18: Two-dimensional elliptic PDE: 234 observations, (t„ = 0.04. Posterior distribution of the 
Karhunen-Loeve modes of the log-permeability, as computed with the map and with MCMC. 
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(b) Posterior standard deviation of log (k — kq) 



Figure 19: Two-dimensional elliptic PDE: 234 observations. (t„ — 0.04. Posterior mean and standard 
deviation of log (k — kq) computed with the map. 
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(a) Posterior mean of log (k 
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(b) Posterior standard deviation of log (k — kq) 



Figure 20: Two-dimensional elliptic PDE: 234 observations. (t„ — 0.04. Posterior mean and standard 
deviation of log (k — kq) computed with MCMC. 



55 



4500^ ' ' 

* 

4000- * 

3500 ^ * * * 

3000- * * * 

E 2500 -* / * . * * * * 

cc * * 

o 2000- ^ * ^ * * * 

o 1500- * 
UJ * 
1000- 

500 

0.: 



10 20 30 40 

KL modes 



Figure 21: Two-dimensional elliptic PDE: 234 observations, (t„ = 0.04. Effective sample size after 
5 X 10^ MCMC iterations. 
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(b) Map versus truth. 

Figure 22: Two-dimensional elliptic PDE: 227 observations, cr„ = 0.04, 139-diniensional posterior. 
Posterior mean of the Karhunen-Loeve modes as computed with MCMC and with the map. 
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Figure 23: Two-dimensional elliptic PDE: 227 observations, (7„ = 0.04, 139-dimensional posterior. 
Posterior standard deviation of the Karhunen-Loeve modes as computed with MCMC and with the 
map. 
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Figure 24: Two-dimensional elliptic PDE: 227 observations, (t„ = 0.04. 139-dimensional posterior. 
Effective sample size after 5 x 10^ MCMC iterations. 
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Figure 25: Two-dimensional elliptic PDE: 227 observations, cr„ = 0.04, 139-dimensional posterior. 
Estimated wallclock time required to generate a particular number of independent posterior samples. 
The break-even point is at approximately 400 samples. MCMC burn-in time has not been included. 
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Figure 26: Two-dimensional elliptic PDE: 227 observations, (t„ = 0.04, 139-dimensional posterior. 
Estimated number of forward model evaluations required to generate a particular number of inde- 
pendent posterior samples. The break-even point is at approximately 200 samples. MCMC burn-in 
time has not been included. 
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Appendix A. Relation between Vciriance and Kullback-Leibler divergence 
necir convergence 

When the variance of T{X) is small, the KuUback Leibler-divergence from p to 
p is small as well. In this case, the relationship between these two quantities is 
asymptotically linear. To demonstrate this, we start from the KL divergence: 



L>KL(p||p)=logE[exp(r)]-E[r]. 
For small perturbations of T around Tq = E [T] , we obtain 



(A.l) 



D 



KL 



logE 
logE 



exp(To)(l + (T-To) + -(r-To)^ 



i + (r-ro) + 2(^-^o)' 



E[7o] 



log (^l + ^Var[r]^ 
Var [T] / 2 



(A.2) 



Appendix B. Linear-Gaussian model 



The special case of a linear forward model h{x) — Ax with additive Gaussian 

noise e ~ ^'^(O, E„) and a Gaussian prior x --^ N{0, Ep) admits a closed form solution, 
where the posterior is also Gaussian. Without loss of generality, we assume a prior 
mean of zero and write the posterior density q as: 



q{z) 



^exp (-^(P.-rf|||„ + |H|y) 
= ^ exp (-i (IMIII^ + (A^E„-U + E-) . + 2d^J:-^Az)^ 
= ^exp(^-i(|M|||„ + z^E-^z-2/.^E-^^)^ (B.l) 

where ^, E, and /3 are the posterior mean, covariance, and evidence, respectively. 
Equating the second and third lines above, we obtain the following relations: 

E = {A^K'A + J:j.r' 

^ = exp(^-l(||cii||^-/x^E-V^))v1d^- (B.2) 



The map from the prior to the posterior is then given by 

f{x) = Zq^ ZxX 



(B.3) 
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where Zq = fi and the only constraint on Zi is that E = ZiUpZj . 

This result is informative because it indicates that the matrix Zi is not uniquely 
determined. Indeed, if the prior covariance is the identity, then any orthonormal 
matrix Q can result in a new map with ZiQ replacing Zi. 



Appendix C. Nonlinear model with Gaussian prior and Gaussian additive 
noise 

In this appendix we provide detailed expressions for derivatives with respect to 
optimization parameters in the case of a nonlinear forward model /i, additive Gaussian 
observational error e ~ A^(0,S„), and a Gaussian prior on the parameters xJ For 
simplicity we let the prior have zero mean and identity covariance, x ~ A^(0,/). In 
particular, we obtain derivative expressions needed to solve the optimization problem 
when the map / is represented with multivariate polynomials. 

Let p be the prior density, q the posterior density, and d the data. The problem 
setup is summarized as follows: 

p{x) oc exp ^— -x^^x 

d = h{z) + e 
q{z) = ^exp(^-l{\\h{z)-d\\l^+z^z) 

Using the map: 

z = f{x) 

we obtain the following p 

p{x) = ^exp (^-^ {\\h{f{x)) - d\\l^ + fixffix) - 2\og\detDJ\)^ 

where 

Whifix)) - d\\l^ ^ {h (fix)) - df (h (fix)) - d) 

and det D,j.f is the determinant of the Jacobian of the map. Following (12), T is given 
by 

-2T(x) = \\hU\x))-d\\l^ + ||/(x)f -21og|detD../| - ||xf (C.l) 
Assume that the map is given by the polynomial chaos expansion (28) 



f{x) = F^^(x) 



^These assumptions are typical for PDE-constrained inverse problems, and include the case of 
Gaussian priors derived from differential operators. 
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We need to compute derivatives of T{x; F) with respect to elements of the matrix F. 
Recall that n is the dimension of the model parameter vector. Hence 



dz 
dF 



z = f{x) 
dfix) 



dF 

dh{f{x)) dh{F^^) 



dF 



dF 



and 



dl 
dx 

d{\og\detDJ\) 
dF 



dh{z) dz dh 
dz dF dz 



dx 

trace ( [D^f]'^ 



{In ® 



(C.2) 
(C.3) 



d 



dFi 



[DJ] 



dx 



d^Y^ 
dx J 



(C.4) 



where the notation [A\ (:) signifies a rearrangement of the elements of the rectangular 
matrix A into a single column vector. Returning to (C.l) above, we obtain the 
following expression for the first derivatives of T evaluated at x: 



dT{x] F) 



dh{f{x)) 
dF 



T 



j:-'{h{f{x))-dy 



dfjx] 
dF 



T 



d^ 

dx 



F^ 



d^y'' 

dx J 



Using (C.2) and (C.3) 
dT{x] F) 



{hU{x))-d) + f{x) 



+ 



d^ 
dx 



dx J 



F^^ I 



(C.5) 



where derivatives with respect to z are evaluated aX z = f{x). 

To evaluate the second derivatives needed for Newton's method, one needs to 
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compute the second derivative of the logarithm of the Jacobian determinant: 
aiog|detL>./| fd^ A 

91og|detD^/| 



dFij dx 



{t,:)iDjr'e, 



9^1og|dctD./| /^^*r wn n-A (^"^i wnf^-i. 

where ej is a vector of all zeros except for a 1 in the jth location. It is clear that the 
only expensive quantity to be computed is the matrix 



BF- BF \ BF - / BF \ BF BF 



T 

K\hU(x)) - d) 



ij / ^"mn 

Notice that storage of the object B'^h{f{x))/BFBF requires special care in implemen- 
tation, since it is the second derivative of a vector- valued function with respect to a 
vector. We find that it is best stored as a third-order tensor of dimension m x £ x £, 
where m is the number of observations and £ is the number of optimization vari- 
ables. Alternatively, one could instead retain only the second derivative of the term 
\\h{z) — ci|||^ with respect to F, which can be stored in standard matrix format. For 
example, consider the special case in which the forward model h is approximated 
using the polynomial chaos expansion 

h{z) = C^^niz) 

where C is the output matrix and has m columns. The second derivative of \\h{z) — 
^lls„ 'with respect to F is computed as follows: 

(cTd%r{zlBfixlY^_,/^B^,{z)Bf{x) 



Bz BF J \ Bz BF 
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Putting everything together we arrive at the following expression for d'^T{x)/dFdF: 



T 



dFdF \\ df{x) J df{x) 

-M^^M. (C.7) 
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